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ABSTRACT 

Motivation: High-throughput sequencing of tumor samples has 
shown that most tumors exhibit extensive intra-tumor heterogeneity, 
with multiple subpopulations of tumor cells containing different som- 
atic mutations. Recent studies have quantified this intra-tumor hetero- 
geneity by clustering mutations into subpopulations according to the 
observed counts of DNA sequencing reads containing the variant 
allele. However, these clustering approaches do not consider that 
the population frequencies of different tumor subpopulations are 
correlated by their shared ancestry in the same population of cells. 
Results: We introduce the binary tree partition (BTP), a novel com- 
binatorial formulation of the problem of constructing the subpopula- 
tions of tumor cells from the variant allele frequencies of somatic 
mutations. We show that finding a BTP is an NP-complete problem; 
derive an approximation algorithm for an optimization version of the 
problem; and present a recursive algorithm to find a BTP with errors in 
the input. We show that the resulting algorithm outperforms existing 
clustering approaches on simulated and real sequencing data. 
Availability and implementation: Python and MATLAB implementa- 
tions of our method are available at http://compbio.cs.brown.edu/ 
software/ 

Contact: braphael@cs.brown.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

1 INTRODUCTION 

Cancer is a disease driven by somatic mutations that accumu- 
late in the genome during the Hfetime of an individual. 
High-throughput sequencing technologies now provide an un- 
precedented ability to measure these somatic mutations in 
tumor samples (Ding et aL, 2013). Application of these technol- 
ogies to cohorts of cancer patients has revealed a number of new 
cancer-causing mutations and cancer genes (Kandoth et aL, 
2013; Lawrence et aL, 2013; Vogelstein et aL, 2013). Cancer 
sequencing studies have also demonstrated that most tumors ex- 
hibit extensive intra-tumor heterogeneity characterized by individ- 
ual cells in the same tumor harboring different complements of 
somatic mutations (Ding et aL, 2012; Gerlinger et aL, 2012; Nik- 
Zainal et aL, 2012; Schuh et aL, 2012; Shah et aL, 2012). Such 
heterogeneity is a consequence of the fact that cancer is an evo- 
lutionary process in a population of cells. The clonal theory of 
cancer evolution (No well, 1976) posits that the cells of a tumor 
descended from a single founder cell. This founder cell contained 
an advantageous mutation leading to a clonal expansion of 
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a large population of cells descended from the founder. 
Subsequent clonal expansions occur as additional advantageous 
mutations accumulate in descendent cells. A sequenced tumor 
sample thus consists of multiple subpopulations of tumor cells 
from the most recent clonal expansions (Fig. 1). 

Nearly all cancer sequencing efforts thus far sequence DNA 
from a single sample of a tumor at a single time. This is because 
of technical limitations: the most cost-effective DNA sequencing 
technologies (e.g. lUumina) require input DNA from many 
tumor cells, and such samples are typically available only when 
patients undergo surgery (Occasionally, paired samples from two 
time points, such as a primary tumor and a metastasis, are also 
sequenced.). While sequencing of multiple samples from the same 
tumor (Gerlinger et aL, Newburger et aL, 2013; Salari 

et aL, 2013) or single-cell sequencing (Hou et aL, 2012; Navin 
et aL, 201 1; Xu et aL, 2012) might eventually provide even better 
datasets to assess intra-tumor heterogeneity, technical consider- 
ations have limited their applicabihty utility thus far. Thus, there 
is tremendous interest in methods that infer the relative propor- 
tion of different subpopulations of tumor cells in a single sample. 

Several recent studies (Ding et aL, 2012; Nik-Zainal et aL, 
2012; Shah et aL, 2012) have demonstrated that it is possible 
to infer the subpopulations of tumor cells by counting the 
number of DNA sequence reads that contain a somatic muta- 
tion. For single-nucleotide mutations, or variants, the variant 
allele frequency (VAF) is defined as the fraction of DNA 
sequence reads covering the variant position that contains the 
variant allele rather than the reference/germ line allele. The VAF 
provides an estimate of the fraction of tumor chromosomes con- 
taining the mutation, but with error due to the stochastic nature 
of the sequencing process (Fig. 1). In addition, technologies cur- 
rently used in cancer sequencing studies produce short reads that 
rarely contain more than one somatic mutation. Thus, for any 
pair of somatic mutations, the only information available to 
distinguish their subpopulation of origin is the VAF. 

To overcome substantial variability in measured VAFs, a 
common approach is to cluster VAFs and from these clusters 
infer the number and proportion of various subpopulations of 
tumor cells in the sample. A number of techniques have been 
introduced to perform this clustering, with Dirichlet Process 
Mixture models and related non-parametric models being par- 
ticularly popular, as they do not fix the number of clusters in 
advance (Miller et aL, forthcoming; Nik-Zainal et aL, 2012; Shah 
et aL, 2012). VAF clusters correspond to tumor subpopulations, 
and the cellular fraction, or fraction of tumor cells containing 
the cluster of somatic mutation, is derived from the VAF of 
the clusters. In the simplest case, the VAF directly determines 
the cellular fraction: e.g. a cluster with VAF = 0.5 corresponding 
to homozygous mutations in 50% of tumor cells, or 



© The Author 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 
by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial 
re-use, please contact journals.permissions@oup.com 



A combinatorial approacli for analyzing intra-tumor heterogeneity 




Time (celP divisions) 




33.75% 



26.25% 



40% 



sequencing tlie sample & 
obtaining variant allele frequency data 



estimating subpopulation 
frequencies 




Variant Allele Frequency (VAF) 



70 



80 



90 



Frequencies of Mutations 




Fig. 1. (a) The cells in a tumor descend from a single founder cell via multiple waves of clonal expansion. Each circle represents a population, each dot 
corresponds to a mutation and the shaded sections indicate the cells descended from each founder cell of the clonal expansion, (b) Under mild 
assumptions, these clonal expansions give rise to a BTP, with nodes representing populations of tumor cells with specific subsets of somatic mutations, 
(c) The variant allele frequencies (VAFs) of somatic mutations are determined from sequencing data and used to infer tumor subpopulations and/or the 
BTP. Here the clusters correspond to clonal expansions, and center of each cluster estimates the frequency of the newly formed subpopulation (denoted 
with colored marker). Note that the size of each cluster depends on the number of mutations that accumulate before its expansion 



heterozygous mutations in 100% of tumor cells. However, in 
practice, the inference of cellular fraction is complicated by 
copy number aberrations and normal admixture (percentage of 
the tumor sample that is normal cells), and these two factors are 
themselves correlated (Carter et aL, 2012; Oesper et al., 2013; 
Strino et ai, 2013). We will not consider such comphcations 
here; rather we restrict attention to heterozygous somatic muta- 
tions outside of copy number aberrations; assumptions that were 
made in sequencing studies including Ding et al. (2012). 

With two exceptions (Jiao et ai, 2014; Strino et al., 2013), 
current techniques for clustering VAFs treat each subpopulation 
independently and do not consider that these frequencies are 
correlated by the fact that they are partitions of the same cellular 
population. Thus, these approaches do not exphcitly construct 
an evolutionary history of the accumulated somatic mutations in 
the cells. Strino et al. (2013) performs a heuristic search over 
possible trees, and we discuss Jiao et al. (2014) below. 

Contributions. In this article, we formulate the problem of infer- 
ring subpopulations of tumor cells from VAF data obtained from 
a single tumor sample as the combinatorial problem of construct- 
ing a Binary Tree Partition ( BTP). We show that the problem of 
finding a BTP is NP-complete and present a | — o(l)-approxima- 
tion algorithm for a related max -BTP problem. The approxima- 
tion algorithm is based on Local Search and we use the 
well-known packing bound of Hurkens and Schrijver (1989) for 
the purpose of analysis. Next, we define e-BTP, a generalization of 
the BTP problem that allows for the possibihty that VAFs are 
observed with errors and some VAFs are not observed. We pre- 
sent a straightforward recursive algorithm to find an e-BTP and 



show that this algorithm outperforms existing VAF clustering 
approaches on simulated and real data. This recursive algorithm 
is fast in practice and runs in less than a minute, on a single CPU, 
for each of our simulated or real samples. 

2 TUMOR SUBPOPULATIONS AND THE BTP 

In this section, we formulate the problem of determining tumor 
subpopulations from VAF clusters. The clonal theory of cancer 
evolution proposes that the cancerous cells in a tumor are the re- 
sult of multiple waves of somatic mutation and clonal expansion. 
Given the relationship between sequence coverage (1000- 
10 000 X for targeted studies) and number of tumor cells that 
are sequenced in a tumor sample (millions), we first assume 
that any somatic mutation reported in the data must be present 
in an appreciable fraction of tumor cells. This imphes that the 
observed somatic mutations were present in at least one clonal 
expansion. Second, as in other recent studies (Jiao et al, 2014; 
Salari et al, 2013), we assume that somatic mutations follow the 
infinite sites assumption such that at most one single mutation 
occurs at a genomic locus (e.g. single position) during the evolu- 
tion of the tumor. It follows from this assumption that if a mu- 
tation ^ occurs in a tumor cell subsequent to a mutation a, then 
the fraction of cells containing mutation a must be at least as 
large as the fraction of cells containing p. This condition was also 
recently noted in Jiao et al. (2014). 

We assume that at any particular time in the cancer progres- 
sion at most one cell in the tumor population acquires a new 
mutation leading to a clonal expansion. We emphasize that 
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this assumption restricts only the number of clonal expansions 
that begin at a given time and not the number of clonal expan- 
sions that are ongoing at one time. Under this assumption, each 
clonal expansion splits the present tumor cell population into 
exactly two subpopulations: the subpopulation P of cells con- 
taining the newly acquired somatic mutation and the subpopula- 
tion P' of cells without the mutation (and possibly with a 
different set of new mutations occurring later in time). Thus, 
the ancestral history of the sequenced tumor cell population is 
represented by a rooted binary tree with nodes corresponding to 
populations of cells at each clonal expansion, and edges indicat- 
ing the ancestral relationships between these populations. 
Consequently, each node v in the tree has a set of somatic 
mutations that accumulate along the unique edge that con- 
nects V to its parent (As the root r does not have a parent, the set 
Mr represents the somatic mutations that accumulate before the 
first clonal expansion. Alternatively, we can let Pr be a hidden 
edge that connects the founder cell of the tumor (represented by 
the root r) to the normal cell from which it is derived.). Each 
node V also has a frequency representing the proportion of 
sequenced tumor cells with mutations (Fig. 1). Note that 
most tumor samples contain admixture by normal cells with 
no detectable somatic mutations, and thus in general < 1 . 
A consequence of these assumptions is that every internal node 
V in the tree satisfies the children sum to parents ( CSP) condition: 
^uc\moiv^u = civ We make the following definition: 

Definition 1 . . Given a multiset C = {a\, . . . , an) with 0 < <2/ < 1 , a 
BTP for £ is a complete rooted weighted binary tree T = (K,£) 
with nodes V= {vi, . . . , v„} such that V/ has weight at and every 
internal node satisfies the CSP condition. 

Figure 1 shows a simple example of a complete binary tree in 
which the CSP conditions are satisfied for every internal node 
(also see Supplementary Appendix D.6 for a more general ex- 
ample). Recall that a complete rooted binary tree is a binary tree 
wherein there is a unique node of degree 2 (the root), and every 
node in the tree is either a leaf or has exactly two children. Our 
goal is to construct such a tree from the measured VAF data. We 
define the following. 

Definition 2 (BTP problem). Given a multiset C = {a\, . . . , an], 
find a BTP for C if one exists. 

Note that in some cases, at the split defined by a clonal 
expansion, cells from P' may survive to the present, without 
undergoing additional clonal expansions (e.g. such cells may 
cease dividing, or senesce). In this case, there are no mutations 
that exclusively occur in P' . We discuss this case in Section 5 
below. 



3 COMPLEXITY OF THE BTP PROBLEM 

In this section, we outline the proof of the following theorem. 

Theorem 3. The BTP problem is NP-complete. 

The proof of this theorem relies on the idea of finding a set of 
conflict-free triangles in the multiset C. This idea is also useful 
below for deriving an approximation for a related problem of 
finding a max -BTP, and so we now define the relevant concepts. 



Suppose £ = {<2i, . . . , a„} is a multiset of n elements. For any 
distinct /, j and k such that ai + aj = a^, we define the ordered 
pair (k, {i,j}) as a triangle in the multiset. See Supplementary 
Appendix A.l for an example. We call k as the peak of the tri- 
angle and /, j as the tails of the triangle. 

We say that triangles t = (k, {i,j]) and f = (k\ {i\f]) are in 
conflict k = k' or [ij\ fi {/',/} 7^ 0. In other words, two tri- 
angles are in conflict if and only if they either share a common 
peak or a common tail. A set Z of triangles is conflict-free if no 
pair of triangles in Z are in conflict. If 7" is a BTP for £, for each 
internal node a^ and its children and aj, we have a^ = ai + a^ 
by CSP. Therefore, {k, \i,j\) is a triangle in £, and thus, T cor- 
responds to a set of conflict-free triangles in £: each internal 
node and its children form a triangle in £, and no two triangles 
share a common peak or a common tail. 

Because the number of nodes in a complete binary tree is 
always odd, \C\ being an odd number is a necessary condition 
for the existence of a BTP for C. For a multiset £, with 
|£| = 2g— 1, the size of a conflict-free set of triangles is at 
most q—\. This is because each triangle has exactly two tails, 
and in a set of conflict-free triangles, all the tails must be distinct 
elements. 

We have the following theorem, whose proof is in 
Supplementary Appendix A.2. 

Theorem 4. Suppose C = [a\, . . . , a2q-\]. C has a BTP if and 
only if there exists a set of q — 1 conflict-free triangles in C. 

The proof of NP-completeness of the BTP problem (Theorem 
3) is by a reduction from the Numerical Matching with Target 
Sums (NMTS) problem, (Garey and Johnson, 1979). An instance 
of NMTS is a triple X = (X, Y,B) where X, r,5cZ+, and 
X={xi,...,x,n}, Y=[yx,...,ym] and B= {bi, . . . ,b,n]. The 
goal is to find two permutations ttx and 717 on {1, . . . , m} such 
that Xj,^^i)+yj,y^{) = bi for z = 1 . . . ,m. 

Theorem 5. Let I = (X, Y, B) be an instance of NMTS. Then, X 
has a solution if and only if a particular multiset Cx has a BTP 
( i.e. a set of 2m — 1 conflict-free triangles ) . Moreover, each solu- 
tion ofl can be obtained (in polynomial time) from a BTP of C, 
and vice versa. 

Proof. For a given instance X = {X, Y,B), let x/ = 4(m+l)x/ 
+ \, yi = 4(m+\)yi + 3 and bi = 4(m+ \)bi + 4, \ < i < m. 
Moreover, let = y^'^ _^ bk, 2 <j < m. Now we construct an 
instance Cj of BTP (i.e. a multiset), with 4m — \ elements 
as follows: Cj = {xi, Xm] ^ {y i, 9^} ^ i^i, bm}^ 
A^}- 

(^) First assume that we have two permutations ttx, ttj on 
{1, . . . , m} such that Xj^^(j) +yjtYii) ^ definition of x/, y^ and 

bi we have also Xj^^ij) +y^^(j^ = bi. Now we construct a set of con- 
flict-free triangles for Cx of size 2m — 1: for each i e {1, . . . , m}, 
add the triangle {bi,{Xj^^(i^,yj^^(^^}). In addition, for each 
/ G { 1 , . . . , m — 1 }, we add all triangles (A+ ^ , {bi, bi+ 1 }). 
Thus, by Theorem 4 we obtain a BTP from ttx and Jiy in poly- 
nomial time. 

(<^) Suppose ^ is a set of 2m — 1 conflict-free triangles for d. 
We claim that for each Xi a triangle (^y(/), {xi, yad))) exists, where 
a and y are two permutations on {\,...,n}. Note that this 
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completes the proof, by taking 7Tx= and 7TY = (y~^oa) for 
the instance I. Node x/ cannot be the root, as the largest number 
in Cj is ft,,. Therefore, has a sibling s and a parent p. For all 
i e {1, . . . , m], we have x/= 1, yi = 3, bi = 4 and ft = 4/, all in 
mod 4(m + 1). Thus, if Xi + s=peCx, we have ^ = 3 
(mod4(m+l)) and /> = 4(mod 4(m+ 1)). This implies ^ = >'a(/) 
and p = by(i) for some a(0 and y(i). Finally, because all the elem- 
ents of Cx are presented uniquely in T, a and y are two permu- 
tations on {1, ... , m). Note that we construct ttx and tty from T 
in polynomial time, and the proof is complete. □ 
We note that the proof above shows that the BTP problem 
in NP-complete in the strong sense, i.e. the problem is still NP- 
complete if the elements of the multiset are polynomially 
bounded. In addition, the analogous partition problem for 
non-binary trees is also NP-complete by reduction from the 
subset sum problem. See Supplementary Appendix A.6. 

4 A I - 0(1) APPROXIMATION FOR MAX-BTP 

In the previous section, we showed that for a given multiset C of 
2q — 1 elements, each BTP for C corresponds to a collection of 
q—\ conflict-free triangles. Because C can have at most q—\ 
conflict-free triangles, we define the max -BTP problem to be the 
problem of finding the maximum sized set of conflict-free tri- 
angles. This is a closely related problem to the BTP problem: 
in the context of VAF data, the maximum sized set of conflict- 
free triangles denotes partial information about the ancestral re- 
lationships among mutations. Moreover, for a multiset of m 
elements if max -BTP has a solution of size A then a BTP with 

= m — 1 — 2A additional nodes can be found. See 
Supplementary Appendix A.7. 

We derive a | — o(l) approximation algorithm for the max - 
BTP problem for C. The algorithm is based on Local Search. We 
start with any collection of conflict-free triangles in £ as a solu- 
tion and iteratively add another triangle as follows. For a fixed 
constant t>\, we iteratively replace any subcollection of ^ < ^ 
triangles in the solution with ^ + 1 triangles of C such that the 
new collection still contains only conflict-free triangles. 

It is easy to see that the above local search terminates in poly- 
nomial time. Let OPT be the size of the optimal solution. 
Because we cannot have more than q — 1 conflict-free triangles 
in a solution, OPT<q— 1. After each iteration, the size of the 
collection increases by 1 , and because t is a constant, the search 
procedure at each iteration is polynomial time. Similar to the 
technique that was used in Hajirasouliha et al. (2007), we use 
the packing bound of Hurkens and Schrijver (1989) to prove the 
following theorem (Proof in Supplementary Appendix A.7). 

Theorem 6. There exists a polynomial time algorithm that 
gives an approximated solution to the problem of finding max- 
imum set of conflict-free triangles within a factor of ^ — 8 for 
any 8>0. 



5 THE e-BTP PROBLEM 

Typically on real data, a BTP will not exist — either because the 
frequencies are determined with some error or the VAF data 
does not capture the frequency of a subpopulation that does not 



have mutations that exclusively occur in that subpopulation 
(VAFs provide information only about the proportion of cells 
with a mutation, and do not provide information about propor- 
tions of cells that have a specific mutation and lack another 
mutation.). In this section, we introduce the e-BTP to account 
for these scenarios. Suppose we have the multiset C = 
{a\, . .. ,afn] of observed frequencies and a corresponding VAF 
error vector s = (si,... ,£m) for C, where is the maximum pos- 
sible error in observing a^ for \<i<m. To account for subpo- 
pulations without distinguishing mutations, we may need to add 
auxiliary frequencies to C that correspond to the missing sub- 
population frequencies. We make the following definitions. 

Definition 7 (e-BTP). Given a multiset C = {a\, . . . ,anri with 
associated VAF error vector e = {e\, . . . ^e^), an £-BTP with 
^>0 auxiliary nodes is a BTP for a multiset C = 
{a\, a2, .. . a^ + k) such that for all i <m : \ai — ai\ < £/. We call 
the nodes (2^+ 1, . . . ,ary, + k the auxiliary nodes of the e-BTP. 

Definition 8 (The e-BTP problem). Given a multiset C and an 
associated VAF error vector s, find an e-BTP of C with min- 
imum number of auxiliary nodes such that two auxiliary nodes 
are not siblings. 

The constraint on auxiliary nodes in the definition of e-BTP 
problem follows from the assumptions in our model of cancer 
progression: each branching in the cancer progression happens 
only when at least one clonal expansion starts. So, the VAF data 
captures the frequency of the newly formed subpopulation (see 
Section 2). Thus, at least one of the children of the current sub- 
population node is not an auxihary node. 

It is straightforward to show that for any multiset C of size m, 
it is always possible to obtain an e-BTP with k = m—\ auxiliary 
nodes (proof in Supplementary Appendix A. 3). Also, when 
Si = 0 for all 1 < / < m, a BTP exists for C if and only if the 
corresponding e-BTP has a solution with k = 0 auxiliary nodes. 

To outline our algorithm, we need the following definitions. 

Definition 9 (e-CSP tree). Given a VAF error vector s, an 
e-CSP tree is a (weighted) binary tree, such that for each internal 
node cii we have aj + cik e ± (£/ + ej + €k)], where aj and at are 
the children of 5/. 

We say that an e-CSP tree T for a multiset C is acceptable if we 
can obtain a BTP, a(T), by replacing each a/ by a value ai where 
Wi — ^i\ < ^i- Note that a(T) is an e-BTP for C. Also, note that 
an e-CSP tree is not necessarily acceptable (See Supplementary 
Appendix D.8). However, one can easily check whether a given 
e-CSP tree T is acceptable by finding a collection of e/s, where 
k/l < ^i, satisfying the following constraints: («/ + ei) = (aj + ej) + 
(ak + ek), for each internal nodes a, and its children aj, ak- This 
can be easily done via a linear program, which we denote by 
LP(f). 

Our Rec-BTP algorithm (Algorithm 1) uses a recursive 
method that works as follows: at each recursion during the al- 
gorithm, we have (i) a partially constructed e-CSP tree T, (ii) a 
multiset of remaining frequencies C and (iii) the number of re- 
maining auxiliary nodes that we are allowed to use. We check if 
T can be extended by attaching two elements of £, or one elem- 
ent from C and an auxiliary node, to one of the leaves in T (we 
assign the auxiliary node's weight accordingly). If £ is empty, it 
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means the algorithm has constructed an e-CSP tree. So we 
output |a(r)| if LF(f) has a feasible solution. Finally, Rec- 
BTP outputs all the e-BTPs. Iterating over all values of k from 
0 to m — 1, the algorithm will find the smallest k such that there 
exists an e-BTP. 

Later in Section 6, for the purpose of benchmarking our re- 
sults, in case of multiple e-BTP outputs, we choose only the tree 
whose Hst of node frequencies has the minimum root mean 
square deviation (RMSD) from the original VAFs data (defined 
below in Section 6). 



6 EXPERIMENTAL RESULTS 

6.1 Simulated data 

We generate simulated mutation data from all complete rooted 
binary trees with 3, 5, 7 and 9 nodes (Supplementary Appendix 
D.7). For each tree topology, we generate 1000 random BTPs by 
assigning a weight <2/ to each node /, as follows. For the root r, we 
set Gf. = \, assuming that our tumor sample is pure, i.e. is not 
contaminated with normal cells. Next, we proceed down the tree: 
for each parent with weight (2/, we select a pair of real numbers aj 
and for the children uniformly at random such that the CSP 
condition (<2/ = aj + at) is satisfied. Finally, we generate a set M/ 
of somatic mutations for each node, with |M/| selected uniformly 
from [50 400] independently for each node. We assume all som- 
atic mutations in the set M/ happened independently because the 
parental cell was created. For each such BTP and set of muta- 
tions, we generate a VAF data corresponding to each tumor 
subpopulation. Ideally, the VAF of a tumor subpopulation 
from node v equals a^. However, because the observed frequen- 
cies are estimated from alignments of sequencing data, the 
observed frequencies will deviate from the true values. We 
assume that the observed VAFs for mutations of a subpopula- 
tion V are normally distributed with mean a^ and standard devi- 
ation or. Specifically, for each node v, let be a set of \My\ 
samples from Mia^, g). Here, we present the results when 
(7 = 2, with results on c = 1, 4 in Supplementary Appendix B. 
The VAF data for the tumor sample is thus X= UyXy. Note that 
we simulate VAFs directly rather than the number of mapped 
reads containing a mutation. Because we assume that our se- 
quence coverage is high (>1000x), it follows that the correspond- 
ing binomial (or negative binomial) distribution of read counts is 
well approximated by the normal distribution. For lower cover- 
ages, the asymptotic normal approximation may not model the 
data as accurately; nevertheless, the simulations provide a com- 
parison of the different methods on high-quality data. 

For each simulated VAF dataset X, we estimate the number 
and frequencies of tumor subpopulations using two non- 
parametric clustering algorithms: (i) Accelerated variational 
Dirichlet process Gaussian mixture model (AVDPM; Kurihara 
et ai, 2006), as implemented in https://sites.google.com/ 
site/kenichikurihara/academic-software/variational-dirichlet- 
process-gaussian-mixture-model, which is a general clustering 
method that we apply directly on VAF data, and (ii) SciClone 
(Miller et al., forthcoming), which is a recent algorithm (with 
available software but no published paper) that estimates 
tumor composition from VAF data by clustering the data 
using a mixture of Gaussian model. Parameter settings for 



each method are given in Supplementary Appendix C. Also, be- 
cause SciClone runtimes were extremely long, we down sampled 
the mutation data by a factor of 20. For each of our synthetic 
dataset, we implanted the fraction of the mutations (together 
with their corresponding VAF) on a synthetic chromosome 
with neutral copy number compatible with the SciClone input 
format (See Supplementary Appendix C). We ran SciClone with 
default parameters on each dataset and then extracted the means 
of reported clusters from SciClone. 

Algorithm 1: Rec-BTP(£, f , e, /cmax) 

Input : Partially constructed tree T, remaining 

frequencies multiset jC, e, and an upper bound 
/cmax on the number of remaining auxiliary nodes. 

Output: List of all e-BTPs that can been obtained by 

expanding T using at most A:max auxiliary nodes 
and the elements of C. 



1 begin 



9 

10 

11 

12 
13 

14 
15 

16 
17 



ifC = 0 then 

if LP{T) has a feasible solution then 
^ return |a(f)|; 

else 

1^ return 0; 

O^0; 

for each leaf at in T do 
for Vtti G £ do 

for each dj ^ CD /g.+e^+et (^t — di) do 
^ attach di and dj to dt ; 
C ^ C — {di, dj}; 
O^OU Rec-BTP(L^ T\ e, /c^ax) 

if /cmax > 0 then 

T' ^ attach di and an auxiliary node 
(with weight dt — di) to dt; 
a ^ C- {a^}; 

O^Oy^ Rec-BTP(/:^ r , e, /cnaax - 1) 
return 0\ 



From the output of each clustering algorithm, we obtain the 
input for our Rec-BTP algorithm. We compute the sample mean 
cLi and standard deviation oi for each cluster Q, and set the VAF 
error for the subpopulation frequency of cluster Q equal 
to 1.96 -c-^, where c is a constant set to 3. Note that 
1.96 ■ <7//VIQi is the radius of the empirical 95% confidence 
interval in estimating the true subpopulation frequency. 

We set £ = {«!,..., 5^} and £ = (ei, . . . , e^) as the input 
to Rec-BTP. We find the minimum k for which there exists a 
e-BTP for C with k auxiliary nodes. In many cases, there are 
multiple BTPs for C with exactly k auxiliary nodes. In these 
cases, we select a single BTP with the minimum cost 
costx(^ = J Z]xeA 1-^ ~ ^'1^' where s = n — k, X is the 
VAF dataset and Di = {x e Z|argminy|(2y — x\ = i} is the subset 
of elements of X that are closest to a^. 

We compare the results of our Rec-BTP algorithm (applied to 
clusters from both AVDPM and SciClone) to the original 
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clusters output from these algorithms over the 1000 randomly 
constructed BTPs for each of the seven tree topologies. We use 
two measures to compare the estimated subpopulation frequen- 
cies: (i) the number of subpopulations and (ii) the RMSD be- 
tween the set of estimated subpopulation frequencies and the 
true subpopulation frequencies. 

Number of subpopulations. Figure 2 shows that Rec-BTP out- 
puts the correct number of clusters more frequently than the 
clustering methods. For trees with 3 and 5 nodes, Rec-BTP 
does not improve the AVDPM clusters much. However, with 
larger number of nodes the advantage of Rec-BTP grows. Rec- 
BTP provides a large improvement over the SciClone clusters. 

We further examined the scenarios where each algorithm 
reported the correct and incorrect number of clusters. Figure 3 
compares the fraction of cases where the clustering method and 
the Rec-BTP assisted method report too few, the correct number 
or too many clusters. We see that most of the cases where Rec- 
BTP reports the correct number of cases (blue squares) are those 
where the clustering algorithm reported too few clusters and the 
Rec-BTP algorithm created additional clusters. Only in the case 
of 3 and 5 nodes do AVDPM and SciClone determine the correct 
number of clusters (green squares) in an appreciable fraction of 
cases. Overall, we see that the clustering methods tend to under- 
estimate the correct number of clusters (first columns in each 
table in Fig. 3). In a significant fraction of these cases, Rec- 
BTP adds auxiliary nodes to obtain the correct number of clus- 
ters, although this becomes more difficult with larger trees. We 



■ ftee-BTP {on AVDPM} AVDPM 




T3 T5 T7A T7B T9A TSB T9C 



Fig. 2. Percentage of trees where each algorithm finds the correct number 
of subpopulations. Light blue/red bars are AVDPM and SciClone, re- 
spectively. Dark blue/red bars are Rec-BTP results using AVDPM and 
SciClone clusters, respectively, as input 



also see that SciClone tends to underestimate the number of 
subpopulations more frequently than AVDPM. 

Accuracy of the subpopulation frequencies. We compare the 
estimated population frequencies and true population frequen- 
cies for each method using the RMSD. Suppose are 
the true subpopulation frequencies, and a\ > . . . > a^ are the 
estimated subpopulation frequencies. If m = n, the RMSD is 

y^y^^"^^ (aj — ajf . If m ^ n, we add zeros to the shorter se- 
quence so the two sequences have equal length. The zeros reflect 
the fact that we have not estimated the frequencies of some 
subpopulations. 

Table 1 gives RMSD for AVDPM, SciClone and Rec-BTP 
built from these clusters. Specifically, we provide the RMSDs 
when AVDPM or SciClone give (i) the same and (ii) less than 
the correct number of subpopulations. For some tree topologies, 
there is no sample in which both Rec-BTP and AVDPM/ 
SciClone give the same number of subpopulations equal to the 
true number of subpopulations, which we denote by N/A. 

In cases where both methods (Rec-BTP and a clustering algo- 
rithm) return the same number of subpopulations, they have 
similar performance in estimating the subpopulation frequencies. 
Also, as mentioned earher, these results are for the simulated 
input data in which the variant allele frequencies deviate 
from their true value with standard deviation a = 2. When 
(7=1 Rec-BTP performs even better. 

6.2 Comparison with Phylosub 

As noted in the introduction, Jiao et al. (2014) is a recent method 
that clusters VAF frequencies using a tree constraint. In particu- 
lar, Jiao et al. (2014) replace the Dirichlet process mixture for 
clustering with a Bayesian non-parametric prior over trees sat- 
isfying a weak form of the CSP constraint. We compared our 
Rec-BTP algorithm with PhyloSub. 

We generated VAF data from a collection of 400 random 
complete binary trees with three, five, seven and nine nodes 
with fixed topologies. For each tree, we generated 100 random 
instances. In contrast to the simulations in the previous section, 
here we used only one of the two topologies for trees with seven 
nodes and only one of the three possible topologies for trees with 
nine nodes. We converted each random simulated VAF data to a 
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Fig. 3. Estimating the number of subpopulations using different algorithms, (a) Each entry in the table represents the fraction of random trees obtained 
from AVDPM (columns) and Rec-BTP on AVDPM clusters (rows), (b) Results for SciClone versus Rec-BTP on SciClone clusters, (c) Interpretation of 
each entry: the reported number of subpopulations by each method compared with the true number of subpopulations 
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Table 1. Mean and standard deviation of RMSD over 1000 trees for each method 



Tree Rec-BTP = AVDPM Rec-BTP > AVDPM Rec-BTP = SciClone Rec-BTP > SciClone 



Rec-BTP AVDPM Rec-BTP AVDPM Rec-BTP SciClone Rec-BTP SciClone 



73 


0.9 ±0.8 


0.7±1.1 


1.3 ±0.6 


45.1 ±11.3 


1.2 ±0.4 


1.5±0.4 


2.6 ±1.1 


43.7 ±11 


T5 


6.9 ±5.6 


7.5 ±5.2 


9 ±12.8 


17.4±14.1 


1±0.5 


1.2 ±0.4 


4.2 ±6.4 


15.7 ±9.2 


TIa 


8.6 ±4.8 


8.2 ±4.5 


9.2 ±5.9 


10.1 ±6.5 


N/A 


N/A 


6.9 ±7.7 


12.8 ±7.5 


TJb 


8.4±2.8 


8.1 ±2.4 


8.8 ±6.2 


11.7±7.1 


N/A 


N/A 


6.2 ±6.9 


13.2 ±6.8 


V9a 


N/A 


N/A 


8.9±4.7 


8 ±4.6 


N/A 


N/A 


6.4 ±5.4 


10.8 ±5.5 


V9b 


2±0 


2.2 ±0 


8.1 ±4.5 


8.3 ±4.4 


N/A 


N/A 


5.6 ±4.9 


11.4±5.2 


V9c 


N/A 


N/A 


7.5 ±4.5 


9.9 ±4.4 


N/A 


N/A 


6.1 ±5.1 


10.1 ±4.5 



Note: Bold face text indicates best performance. 



PhyloSub input identical to the procedure performed for the 
simulations in the PhyloSub paper. In creating PhyloSub 
inputs, we assumed the total read counts for every single nucleo- 
tide variants (SNV) position is 10000 (i.e. an ideal uniform 
coverage of lOOOOx for the PhyloSub input) and assumed 
every SNV is heterozygous. On each dataset, we ran the 
Markov chain Monte Carlo (MCMC) method of PhyloSub 
100 times, each with 5000 MCMC iterations as per Jiao et al. 
(2014), and used the reported top trees (i.e. those trees with best 
log likelihood) in our comparison. 

We found that PhyloSub produced trees with many more 
nodes than the simulated value, and significantly more than 
Rec-BTP or SciClone (Fig. 4 and Supplementary Appendix D. 
9]). Because PhyloSub usually tends to report trees with a higher 
number of nodes, we also considered the size of the smallest tree 
reported by PhyloSub in their provided list of top trees for each 
input. While this value was smaller, it was still much larger than 
the true value or the values from the other approaches. 

The large number of clusters produced by PhyloSub might 
result from the fact that the method does not assume that the 
trees are binary. The output trees contain many different topol- 
ogies. Nevertheless, it is surprising that PhyloSub does not find 
binary trees when the data are produced from this topology. As 
PhyloSub reports a higher number of nodes than Rec-BTP, we 
were unable to directly compare the provided frequencies of the 
clusters of each method. 

6.3 Acute myeloid leukemia sequencing data 

We tested our algorithm on VAFs obtained from deep read 
counts information for SNVs from an acute myeloid leukemia 
sample (AML1/UPN933124) using data from Ding et al. (2012). 
We used the 386 SNVs reported in the primary AML sample, 
obtaining the tumor VAF data directly from Supplementary 
Table S5a in Ding et al. (2012). Note that Ding et al. (2012) 
also report data from a relapse sample following chemotherapy. 
As the relapse-specific mutations form only one cluster, we do 
not analyze the BTP problem for this sample in our study. 
Nevertheless, the generalization of the e-BTP problem for the 
case where the input data contains two types of VAFs (e.g. 
both tumor- and relapse-specific mutations) is an interesting 
open problem. We first ran SciClone on the VAF data, obtaining 
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Fig. 4. A violin plot for the number of clusters output by Rec-BTP, 
SciClone, PhyloSub where the top tree is considered, and PhyloSub 
where the tree, among the top ones, with the minimum number of 
nodes is considered. The j-axis shows the number of nodes in each 
tree, while the histogram in each violin plot corresponds to tree sizes of 
different experiments 



four distinct VAF clusters with means of 47.17, 33.17, 22.42, 
3.65%. We then ran Rec-BTP on these clusters, fixing the root 
of the BTP as an additional node with frequency 100%, reflect- 
ing the fact that the tumor sample was pure and started from a 
single founder clone (Ding et al, 2012). 

Figure 5 shows the resulting e-BTP with corresponding multi- 
set of population frequencies: Crcc-btp^ {100, 53.75, 46.25, 
42.86, 32.25, 21.5, 3.39}. Ding et al. (2012) present a history 
of clonal expansions that implies an e-BTP for the multiset 
{100, 53.12, 12.74, 29.04, 5.1} with two auxiliary nodes 
(Fig. 4b). There are two such e-BTP, depending on the relative 
order of two clonal expansions (Fig. 4), one with subpopulation 
frequencies: Ci = {100, 53.12, 46.88, 12.74, 34.14, 29.04, 5.1} and 
another with frequencies £2= {100, 34.14, 65.86, 53.12, 12.74, 
29.04, 5.10}. However, because the frequencies reported in Ding 
et al. (2012) are scaled according to the estimated 93.72% purity 
of their sample, it is necessary to multiply the frequencies in C\ 
and C2 by 0.9372 before comparing with the clusters obtained 
from the VAF data. 

We compare these different subpopulation frequencies esti- 
mates by computing the average ii norm between the VAF for 
each mutation and the closest subpopulation. Table 2 shows this 
measure for each of the subpopulation multisets. We see that the 
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Frequencies of mutations 

Fig. 5. (a) The VAF data for AMLl tumor sample. The VAFs are dense around ~47. (b) The e-BTPs for AMLl from Ding et al. (2012) obtained from 
Rec-BTP using SciClone clusters, (c) The two possible e-BTPs derived from the clonal frequencies and ancestral reconstruction proposed in Ding et al. 
(2012). Auxihary nodes are shown in white circles 



Table 2. Comparison of subpopulation frequencies obtained by Rec-BTP 
with two possible subpopulation frequency multisets obtained from Ding 
et al. (2012), based on the calculation of £1 and 12 norms to the original 
VAF data 



Metric 


^Rec-BTP 


Cx 


C2 


Average l\ norm 


1.6367 


3.3309 


1.8376 


Average €2 norm 


0.1161 


0.2221 


0.1331 



Rec-BTP gives a better fit to the VAF data than both estimates 
C\ and £2 given in Ding et al. (2012). This shows that using the 
tree constraint in the BTP provides additional information that is 
useful for clustering real VAF data. 

Note that calculation of Ix and £2 norms for the SciClone 
cluster means would result in 2.4816 and 0.1823, respectively. 
However, because the number of SciClone clusters is only 4, 
these numbers cannot be reasonably compared with the ones 
calculated for the trees in Figure 5. 

7 DISCUSSION 

In this article, we provide the first rigorous combinatorial for- 
mulation of the problem of inferring the composition of tumor 
subpopulations constrained by a tree in a single tumor sample 
from the variant allele frequencies (VAFs) of somatic mutations. 
In our formulation, we introduced a novel definition of the BTP 
and the e-BTP. We showed that the problem of finding a BTP 
(and hence an e-BTP) is in general NP-complete; however, we 
derived an approximation algorithm for a related problem, max - 
BTP. We developed a recursive algorithm for the e-BTP that 
works well in practice and showed the advantages on this algo- 
rithm on simulated and real sequencing data. 

These results show the utility of the BTP, but also suggest 
additional areas of further investigation. In particular, it would 
be interesting to combine the clustering of VAFs and the con- 
struction of the BTP into a single model. While there has been 
some work to combine these two steps using a machine learning 
approach (Jiao et al, 2014), the complexity of the corresponding 
inference problem is unknown. In our tests, we were unable to 
obtain satisfactory results using this model, suggesting there is 
room for additional improvements. One possible direction is to 
use MCMC or other samphng approaches over the space of 
BTPs, perhaps combining this inference into a graphical model 



that better models the features of real sequencing data [e.g. as 
used in pyClone (Shah et al, 2012; Roth et al, 2014)]. Finally, 
the extension of the BTP and related approaches to multiple 
samples from the same tumor (taken at the same or different 
times) will be increasingly useful as such data become available. 
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